Classification using Dirichlet priors when the training data are mislabeled
نویسندگان
چکیده
The average probability of error is used to demonstrate performance of a Bayesian classification test (referred to as the Combined Bayes Test (CBT)) given the training data of each class are mislabeled. The CBT combines the information in discrete training and test data to infer symbol probabilities, where a uniform Dirichlet prior (i.e., a noninformative prior of complete ignorance) is assumed for all classes. Using this prior it is shown how classification performance degrades when mislabeling exists in the training data, and this occurs with a severity that depends on the value of the mislabeling probabilities. However, an increase in the mislabeling probabilities are also shown to cause an increase in M (i.e., the best quantization fineness). Further, even when the actual mislabeling probabilities are known by the CBT, it is not possible to achieve the classification performance obtainable without mislabeling.
منابع مشابه
An algorithm for correcting mislabeled data
Reliable evaluation for the performance of classifiers depends on the quality of the data sets on which they are tested. During the collecting and recording of a data set, however, some noise may be introduced into the data, especially in various real-world environments, which can degrade the quality of the data set. In this paper, we present a novel approach, called ADE (automatic data enhance...
متن کاملImproving Automated Land Cover Mapping by Identifying and Eliminating Mislabeled Observations from Training Data
This paper presents a new approach to identifying and eliminating mislabeled training samples. The goal of this technique is to decrease the error of classification algorithms by improving the quality of the training data. The approach employs an ensemble of classifiers that serve as a filter for the training data. Using an n-fold cross validation, the training data is passed through the filter...
متن کاملRobustness of compound Dirichlet priors for Bayesian inference of branch lengths.
We modified the phylogenetic program MrBayes 3.1.2 to incorporate the compound Dirichlet priors for branch lengths proposed recently by Rannala, Zhu, and Yang (2012. Tail paradox, partial identifiability and influential priors in Bayesian branch length inference. Mol. Biol. Evol. 29:325-335.) as a solution to the problem of branch-length overestimation in Bayesian phylogenetic inference. The co...
متن کاملA method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer
MOTIVATION An accurate diagnostic and prediction will not be achieved unless the disease subtype status for every training sample used in the supervised learning step is accurately known. Such an assumption requires the existence of a perfect tool for disease diagnostic and classification, which is seldom available in the majority of the cases. Thus, the supervised learning step has to be condu...
متن کاملUsing Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Familiesz
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999